perf: generic literal-prefix fast path for regexp_replace ${1}#22354
perf: generic literal-prefix fast path for regexp_replace ${1}#22354Dandandan wants to merge 1 commit into
${1}#22354Conversation
Generalizes the existing `^...(capture).*$` -> `${1}` extraction in
`OptimizedRegex` for the common subset where the regex up to the
capture reduces to a finite set of literal byte prefixes and the
capture has the form `([^X]+)X` for a single ASCII byte X.
For inputs of that shape:
- `^https?://(?:www\.)?([^/]+)/.*$` (ClickBench Q28)
- `^foo:([^,]+),.*$` (single literal prefix)
- `^(?:foo|bar|baz):([^/]+)/.*$` (alternation prefix)
the recognizer parses the pattern's HIR once via `regex-syntax`,
enumerates the literal prefix variants (bounded by 32 alternatives),
and dispatches each row to a `memchr`-based extractor instead of the
regex engine.
Longest-matching-prefix is tried first; on empty-capture failure the
extractor falls back to shorter prefixes. That preserves the regex's
backtracking semantics for cases like `http://www./path` against the
Q28 pattern, where the full regex prefers to leave `www.` outside the
optional so the capture is non-empty.
Patterns that don't match the literal-prefix shape continue through the
existing `ShortenedRegex` path (strip trailing `.*$`, use
`captures_read` against reusable `CaptureLocations`). Recognition is
strict — `(?i)`, `(?m)`, non-ASCII terminators, and unbounded prefix
constructs all fall back.
Measured on ClickBench Q28
(`REGEXP_REPLACE("Referer", '^https?://(?:www\.)?([^/]+)/.*$', '\1')`,
partitioned dataset, dfbench `--iterations 5 --query 28`, same machine):
| Build | Avg ms |
| ------------------------------------ | -------: |
| Upstream main (shortened-regex only) | 4577.52 |
| Literal-prefix fast path | 2225.00 |
Delta: -51.4% on Q28.
Validation:
- cargo test -p datafusion-functions --lib regex::regexpreplace
- cargo clippy -p datafusion-functions --all-targets --all-features -- -D warnings
|
run benchmarks |
|
🤖 Benchmark running (GKE) | trigger CPU Details (lscpu)Comparing perf/generic-regexpreplace-prefix-capture (b140b80) to c8b784a (merge-base) diff using: clickbench_partitioned File an issue against this benchmark runner |
|
🤖 Benchmark running (GKE) | trigger CPU Details (lscpu)Comparing perf/generic-regexpreplace-prefix-capture (b140b80) to c8b784a (merge-base) diff using: tpch File an issue against this benchmark runner |
|
🤖 Benchmark running (GKE) | trigger CPU Details (lscpu)Comparing perf/generic-regexpreplace-prefix-capture (b140b80) to c8b784a (merge-base) diff using: tpcds File an issue against this benchmark runner |
|
🤖 Benchmark completed (GKE) | trigger Instance: CPU Details (lscpu)Details
Resource Usagetpch — base (merge-base)
tpch — branch
File an issue against this benchmark runner |
|
🤖 Benchmark completed (GKE) | trigger Instance: CPU Details (lscpu)Details
Resource Usagetpcds — base (merge-base)
tpcds — branch
File an issue against this benchmark runner |
|
🤖 Benchmark completed (GKE) | trigger Instance: CPU Details (lscpu)Details
Resource Usageclickbench_partitioned — base (merge-base)
clickbench_partitioned — branch
File an issue against this benchmark runner |
|
nice, there is still quite some potential |
Which issue does this PR close?
Rationale for this change
OptimizedRegex(introduced for ClickBench Q28-style patterns) already handles^...(capture).*$→${1}by stripping the trailing.*$and running the shortened regex with reusableCaptureLocations. That avoidsexpand()and string allocation, but still pays for full regex matching on every row.For the common subset where the regex up to the capture reduces to a finite set of literal byte prefixes and the capture has the form
([^X]+)Xfor a single ASCII byteX, we can do strictly better: parse the pattern's HIR once at planning time, enumerate the literal prefix variants, and dispatch each row to amemchr-based extractor — no regex engine involvement per row.Patterns that fit this shape:
^https?://(?:www\.)?([^/]+)/.*$(ClickBench Q28)https://www.,http://www.,https://,http:///^foo:([^,]+),.*$foo:,^(?:foo|bar|baz):([^/]+)/.*$bar:,baz:,foo:/Any pattern that doesn't fit (case-insensitive, multiline, non-ASCII terminator, unbounded prefix) falls back to the existing
ShortenedRegexpath with no change.Performance
Measured on ClickBench Q28 (
REGEXP_REPLACE(\"Referer\", '^https?://(?:www\.)?([^/]+)/.*\$', '\1'), partitioned dataset,dfbench --iterations 5 --query 28, same machine):Per-iteration:
What changes are included in this PR?
LiteralPrefixCaptureSpecvariant of the internalShortRegexenum, holding a deduplicated longest-first list of literal prefixes plus a single-byte terminator.try_recognize_literal_prefix_capture(pattern)parses the pattern withregex-syntax::parse, walks the HIR, enumerates prefix variants (bounded byMAX_PREFIX_VARIANTS = 32), and verifies the capture is greedy[^X]+followed by a literalX,.*,\$.http://www./pathagainst the Q28 pattern, where the regex prefers to leavewww.outside the optional so the capture is non-empty.regex-syntax(already pulled in viaregex, version pinned to0.8). Wired through theregex_expressionsfeature so it's only compiled whenregexis.Are these changes tested?
New unit tests in
datafusion-functions::regex::regexpreplace::tests:literal_prefix_recognizer_accepts_clickbench_q28— exact prefix list and terminator for the Q28 pattern.literal_prefix_recognizer_accepts_single_literaland_accepts_alternation— basic shapes.literal_prefix_recognizer_rejects_non_anchored,_rejects_unbounded_prefix,_rejects_non_ascii_terminator,_rejects_case_insensitive— guardrails on what the recognizer will and won't accept.literal_prefix_fast_path_matches_full_regex_for_q28_patternand_for_alternation_pattern— differential tests that run the optimized path and the fullregex::Regexon the same inputs (including edge cases likehttp://www./path, embedded\n, empty captures, and non-matching inputs) and assert byte-equal output.All 22 tests in this module pass (13 pre-existing + 9 new).
Are there any user-facing changes?
No.
regexp_replacesemantics are unchanged — any input that wouldn't go through the fast path follows the exact same code as before, and inputs that do go through the fast path are differentially tested against the full regex.